MultiThreading BenchmarksRoy Longbottom
CPU Speed Via Assembly Language Add InstructionsThe benchmarks use an integer test and a floating point test. They are first executed separately, followed by together in two or more threads, with speeds measured in Integer MIPS or MFLOPS. Below are example log files for a quad core 8 thread Core i7 CPU. Results are available in PC CPUID 1994 to 2013, plus Measured Maximum Speeds Via Assembler Code.pdf. The separate tests indicate three integer MIPS per MHz and (nearly) expected maximum SSE floating point adds of four per clock cycle, also, significantly higher throughput via eight threads, compared to four.First Windows Versions (obsolete) - cpuidmp.exe and cpuidMP64.exe are included in dualcore.zip with source code in newsource.zip. This covered 1, 2 and 4 threads. Later Windows Versions - cpuid8thread32.exe and cpuid8Thread64.exe and source code can be found in quadcore.zip . Further details and results are included in quad core 8 thread.htm. Test functions measure performance using 1, 2, 4, 6 and 8 threads. Linux Versions - cpumaxmp32 and cpumaxmp64 with source code in
linux_multithreading_apps.tar.gz .
The benchmarks can have an input parameter for 1, 2, 4, 8, 16, 32 or 64 threads (example command ./cpumaxmp32 Threads 8), default being identified count, such as 8 for a quad core CPU with hyperthreading.
Further details and results can be found in
linux multithreading benchmarks.htm.
This variety has separate tests for integer and floating point calculations at the designated thread count.
|
Windows 1 to 8 Threads - 64 Bit Version MP Bus Speed Test 64 bit Version 2.0 Sat May 10 11:57:03 2014 Part 1 - 1 Thread MBytes/Second 32 bit Kbytes Inc32wds Inc16wds Inc8wds Inc4wds Inc2wds ReadAll 128bSSE2 ReadAll 128bSSE2 6 31565 31291 31178 42042 42508 41978 61606 21375 61610 24 31300 31285 31258 42203 42786 41751 62331 21157 62329 96 5375 5559 5793 11083 20009 34332 40516 20363 40673 384 5562 5658 5864 11338 19966 33244 39317 20679 38413 768 5331 5391 5505 10966 19403 32805 37871 20680 38718 1536 5364 5427 5508 10779 19355 33166 37951 20679 38331 16380 1070 1356 1955 4248 8046 16688 16838 14103 16757 131070 1034 1272 1866 4023 7724 16029 15980 13852 15963 Part 2 - 2 Thread MBytes/Second 6 63147 62371 62552 83983 85074 83689 123233 42597 123206 24 62579 62580 62188 84353 85351 83515 124250 42252 124188 96 10779 10875 11473 21904 39332 67550 80624 40717 80597 384 10088 11391 11560 22649 39705 67022 78033 41352 76206 768 10574 10610 11042 21889 38669 65967 76066 41356 77275 1536 10442 10637 10901 21597 38467 66046 75829 41353 76302 16380 1798 2305 3397 7161 13913 28647 28743 25980 28471 131070 1780 2310 3424 7193 13808 28589 28617 26066 28578 Part 3 - 4 Thread MBytes/Second 6 116410 124710 92330 167023 148833 165596 245603 70155 238644 24 124722 124658 96440 143956 153894 165793 248402 67455 225894 96 21213 21636 20486 39631 73042 115995 159914 74935 123866 384 21720 22354 22996 44788 79335 111720 155599 76795 128063 768 18098 19577 21168 41296 71833 128568 126837 75598 138878 1536 13887 19117 20564 37334 73001 126388 143677 74219 129958 16380 2113 2780 4682 9428 18500 36759 37534 36126 37098 131070 2109 2598 4681 8806 18112 37049 37477 36384 35472 Part 4 - 6 Thread MBytes/Second 6 118438 106222 105201 161860 157529 178558 295443 88245 309920 24 89228 71127 80985 110402 127049 167495 216617 87712 228035 96 17634 19432 18990 38043 68990 111843 134485 83460 143356 384 18645 18932 19929 42970 76220 123858 138682 83237 142146 768 18072 17529 19655 40544 65312 124557 132566 79248 141036 1536 14363 16097 18084 35815 59434 104533 128989 73640 123287 16380 2043 2763 4568 9273 18501 36749 36798 36852 36663 131070 2082 2689 4508 9093 18033 35246 36318 36347 36784 Part 5 - 8 Thread MBytes/Second 6 124479 125263 124774 196833 206725 212245 392939 107411 402166 24 53893 57161 59948 89256 129520 173683 263250 100645 259380 96 21217 21589 22492 44013 84359 147831 165343 98906 164050 384 21016 21622 22726 43780 80221 147095 165442 98539 161937 768 19382 20258 21737 42635 80814 144343 159745 98558 160982 1536 9986 10664 12858 24661 49622 83158 93985 60140 92112 16380 2074 2748 4525 9123 18245 36548 36486 36504 36414 131070 2072 2759 4525 9123 18216 36571 36445 36481 36443 Linux Results Next |
Windows versions, and initial Linux programs, arranged for all threads to start by reading data from the beginning. This did not appear to raise any issues via Windows but it clearly did so using Linux. This became particularly noticeable on later CPUs, such as the Core i7 reported on here, with a 10 MB shared L3 cache. Maximum memory data transfer speed of this PC is 51.2 GB/second.
The first results below are for Version 1, single thread, 64 bit and 32 bit, with performance similar to the Windows versions, that is faster integer MB/second via caches at 64 bits. The other results are for 64 bit Version 2, where performance is quite similar to the Windows (Version 1) speeds. {Ignore 6 KB speeds - needs a longer test] The last two columns are for Linux Version 1 results, where RAM speeds are shown to be faster than the 51.2 GB/second specification, due to caching effects. In Version 2, each thread reads all the data but at staggered starting points and additional RAM is read, to prove the point. Now a maximum of 40.6 GB/second is shown, at 4 threads, 2.2 times faster than that with one thread.
MP Bus Speeds 64 bit Version 1.0, 1 Threads, Sun Oct 22 14:03:08 2017 32 bit Kbytes Inc32wds Inc16wds Inc8wds Inc4wds Inc2wds ReadAll 128bSSE2 ReadAll 128bSSE2 6 31168 31240 31189 42408 43371 43670 61517 20443 61544 24 31267 31251 31217 42139 43348 43816 62254 20787 62259 96 13627 14374 15240 24228 32977 40497 60299 20459 60286 384 5556 5707 5797 11305 20134 34224 39990 20366 41534 768 5348 5442 5555 10923 19356 33585 38201 20385 38255 1536 5311 5421 5555 10924 19385 33698 38255 20421 38362 16380 1240 1564 2130 4671 9149 18280 18843 16515 19109 131070 1201 1469 2098 4573 8500 18137 18128 16472 17808 393210 1155 1453 2098 4557 8112 18145 17813 15913 18024 MP Bus Speeds 64 bit Version 2.0, 1 Threads, Sun Oct 22 14:06:58 2017 Version 1 Kbytes Inc32wds Inc16wds Inc8wds Inc4wds Inc2wds ReadAll 128bSSE2 Read All 128bSSE2 6 31594 31270 31258 41133 36625 41267 61563 43670 61517 24 31283 31252 31211 42440 38461 42184 62258 43816 62254 96 14896 15334 15560 24390 32204 39245 60395 40497 60299 384 5703 5835 5988 11721 20542 34338 40726 34224 39990 768 5389 5468 5563 10924 19334 33585 38159 33585 38201 1536 5365 5453 5564 10925 19339 33598 38187 33698 38255 16380 1285 1562 2179 4795 8882 18631 19165 18280 18843 131070 1225 1453 2096 4460 8528 18195 18187 18137 18128 393210 1225 1454 2077 4477 8703 18188 18059 18145 17813 786420 1230 1450 2051 4576 8571 17598 18190 1572840 1216 1486 2086 4583 8647 18109 18427 2 Threads 6 29512 30019 59608 60056 69436 80271 102281 84136 123044 24 59225 59487 58806 83693 75177 83373 124495 86728 121640 96 20250 21156 21937 38565 59794 76975 120371 80333 121121 384 10653 10963 11272 21556 38987 59334 80732 65431 82328 768 10087 10384 10637 19731 36985 63797 75626 63587 76116 1536 10103 10435 10729 20807 37071 63898 76338 63838 76340 16380 2628 3222 4158 8358 15989 32486 33558 32248 33552 131070 1968 2585 3803 8004 15471 31863 32579 32166 33354 393210 1969 2594 3825 7570 15511 31911 32714 32125 33558 786420 1966 2592 3722 7989 15429 32025 32676 1572840 1970 2593 3839 8112 15467 32103 32767 4 Threads 6 25920 29754 58965 64123 95935 147826 260224 167038 205273 24 114028 118093 119688 117904 114405 163844 244665 173073 243044 96 42412 42912 43013 75571 119669 154540 240629 160370 241160 384 20903 21781 22653 42992 77420 128661 163280 127537 159648 768 19201 19029 20653 39706 72719 117327 151481 125191 151515 1536 18637 19725 20659 39744 73196 101971 151967 125482 151584 16380 6026 6764 8179 14740 28176 54888 58802 57175 61785 131070 2034 3088 5019 10004 19712 38982 40418 52960 61816 393210 2033 3099 4303 10048 19856 39126 40572 57642 53405 786420 2068 3092 5050 10077 19819 39096 40628 1572840 2032 2858 4348 9412 19851 39157 39699 8 Threads 6 10245 11452 24238 46432 91436 85135 216659 151955 278208 24 42877 46747 90912 92228 124711 142776 283743 150852 298146 96 36838 44259 43458 80107 122566 136226 193969 138749 276197 384 23488 22078 28973 53186 85603 138786 176651 122014 206507 768 21820 25828 27393 38557 79105 149178 190188 95956 162380 1536 20182 21804 25304 40594 72493 120503 155289 112027 177947 16380 6786 7686 9822 19679 35524 59894 73745 64625 65317 131070 3015 3832 4361 9619 19162 39564 38654 47164 46280 393210 2390 3176 4901 9995 19884 39652 42583 50841 51818 786420 2300 3045 4821 10165 19444 38217 38839 1572840 2032 2992 4792 9680 19259 38238 38778 |
First Windows Versions (obsolete) - RandMP32.exe and RandMP64.exe are also in
dualcore.zip
with source code in
newsource.zip
and further details in
randmem results.htm.
Later Windows Versions (1 to 8 threads) - Rand8Thread32.exe and Rand8Thread64.exe are available in
quadcore.zip .
with further details included in
quad core 8 thread.htm.
Linux Versions MPrandmem32 and MPrandmem64 can be found in
linux_multithreading_apps.tar.gz .
They have the same run time format as the above Linux benchmarks for up to 64 threads. Further details can be found in
linux multithreading benchmarks.htm.
The Linux benchmark has additional Mutex tests that restrict updating access to one thread at a time. The effect appears to produce some faster speeds with cached data but slower from RAM. With the other procedures, multithreading performance gains and losses are different between the Windows and Linux compilations.
Windows 1 to 8 Threads - 64 Bit Version RandMP 8 Thread Write/Read Test 64 bit Ver. 2.0 Sat May 10 14:38:49 2014 ------------------ MBytes Per Second At -------------------- 6 KB 24 KB 96 KB 384 KB 768 KB 1536 KB 12 MB 96 MB 1 Thread Serial RD 30475 30602 18350 17605 17557 17595 12405 11243 Serial RW 30195 30175 22013 17576 17469 17482 11531 10642 Random RD 28916 29109 13124 8290 6319 5669 1308 655 Random RW 30726 29935 9498 6161 5232 4813 1185 608 2 Threads Total Serial RD 61153 60840 36843 35338 35231 35297 23339 21157 Serial RW 21994 21510 20967 21670 33037 33428 23256 21508 Random RD 57862 57902 26154 16611 12484 11248 2622 1302 Random RW 3761 4658 5132 6599 6963 7114 2399 1282 4 Threads Total Serial RD 116765 120919 73499 62205 70973 70902 45280 41568 Serial RW 20776 31110 38023 42715 43836 65636 47000 42876 Random RD 110503 115241 52011 32996 24884 22294 5247 2540 Random RW 3324 6532 8197 11159 11724 12507 4747 2494 8 Threads Total Serial RD 111370 114213 95358 92240 89120 87557 74104 63754 Serial RW 28212 37141 54805 64501 56425 72723 70007 49286 Random RD 108353 110797 59991 41932 32190 14669 4878 2897 Random RW 5150 8024 9153 17569 15918 13841 4661 2528 Linux 1 to 8 Threads - 64 Bit Version RandMemMP Speeds 64 Bit Version 1, X Threads, Sun Oct 22 15:00:43 2017 ------------------ MBytes Per Second At -------------------- 6 KB 24 KB 96 KB 384 KB 768 KB 1536 KB 12 MB 96 MB 1 Thread Serial RD 27991 27801 20258 19249 19249 19294 12477 11683 Serial RW 29969 30241 21896 17829 17494 17499 12085 11565 Random RD 27484 27463 13589 8257 6220 5604 2471 1011 Random RW 30364 30075 9168 6108 5177 4783 2804 982 Mutex SRW 29982 30245 21897 17762 17433 17432 12130 11529 Mutex RRW 30361 30071 9176 6108 5175 4782 2772 982 2 Threads Serial RD 40622 55523 40299 38028 37866 37878 23094 22142 Serial RW 14539 21855 20979 22448 31456 25642 24743 18109 Random RD 40316 54307 26840 16365 12340 11092 4747 1913 Random RW 3039 4599 5107 6570 6943 7115 4904 1773 Mutex SRW 15294 29770 21777 17761 17385 17130 12099 11298 Mutex RRW 22396 29829 9251 6098 5174 4779 2817 970 4 Threads Serial RD 39300 106376 80250 75904 75310 75408 43206 37738 Serial RW 15182 31547 35603 38859 45426 60180 48848 20287 Random RD 72790 104282 52951 31312 12640 21975 6813 3317 Random RW 2582 5910 8171 11159 9140 12510 9591 3261 Mutex SRW 20566 29383 21517 18150 16703 16945 11798 11177 Mutex RRW 22006 29629 8880 5881 5035 4666 2702 967 8 Threads Serial RD 37987 76974 96575 94809 88112 88170 66556 60949 Serial RW 9030 29524 52796 47811 52557 69516 68200 25318 Random RD 37120 76419 65662 32215 24619 22463 13226 3346 Random RW 2013 6036 9032 17133 16426 15039 11082 2829 Mutex SRW 8207 17043 20147 17135 16675 16621 11714 10827 Mutex RRW 9865 20828 8613 5574 4889 4567 2676 951 |
Windows Versions - MPmflops32.exe using 32 bit instructions, MPmflops64.exe with SSE instructions, MPmflopsc2.exe a later 64 bit SSE compilation for full SIMD operation and MPmflopsAVX.exe a 64 bit compilation using /arch:AVX option. The benchmarks and source code are available in gigaflops-benchmarks.zip, with further details and results in GigaFLOPS Benchmarks.htm All were compiled from the same code to handle up to 64 threads (Command Format Example - MPmflopsc2 Threads 8).
Linux Versions - MPmflops32, MPmflops32SSE and MPmflops64, where benchmarks and source code are also in linux_multithreading_apps.tar.gz , again for up to 64 threads. Further details and results can be found in linux multithreading benchmarks.htm. Later MPmflops64AVX was produced and is in AVX_benchmarks.tar.gz, with details in Linux AVX_benchmarks.htm.
Results for runs on Windows and Linux are below. The first is from compilation for old i87 32 bit floating point. The second had a compiler directive to use SSE functions, but only achieved Single Instruction Single Data (SISD) operation, using one word out of the 4 word registers, and slightly faster during the early tests. The third results, with an AVX compiler directive, generated the appropriate vector instructions, but applied to SSE 128 bit registers, to produce the same performance as the SSE tests.
Maximum SSE MFLOPS per core are equal to CPU MHz x 4 (128 bit SSE register width) x 2 (linked multiply and add) or 31.2 GFLOPS for the Core i7 considered here, giving 124.8 GFLOPS for four cores. The 256 bit AVX registers double this score. Both Windows and Linux programs demonstrated respectable performance of more than 90 GFLOPS for SSE and the Linux Benchmark near 180 GFLOPS using AVX instructions.
Windows MFLOPS 1 to 16 Threads Operations Per Word 2 2 2 8 8 8 32 32 32 Million Words 0.10 1.02 10.24 0.10 1.02 10.24 0.10 1.02 10.24 Threads Core i7 4820K 1 3867 3853 3386 6085 6054 6017 5830 5824 5809 256 KB x 4 L2 2 7737 7731 6618 12160 12165 11991 11653 11648 11650 4 core 8 Thrd 4 15433 15459 9833 23487 24291 23886 22666 23175 23220 3900 MHz i87 8 15359 15395 9846 23554 23708 23586 23418 23464 23416 Windows i87 16 15145 15192 10023 23422 23536 22966 23241 23401 23282 Core i7 4820K 1 5004 4960 4192 6188 6182 6135 5890 5890 5887 256 KB x 4 L2 2 9996 10002 8049 12371 12354 12282 11770 11779 11744 4 core 8 Thrd 4 19923 18532 9866 23946 24704 24347 23219 23531 23497 3900 MHz 8 19602 19776 9820 24683 24648 24634 23521 23497 23506 Windows SISD 16 18727 19077 10073 24316 24243 24442 23469 23393 23385 Core i7 4820K 1 10116 9864 5852 24636 24436 19881 23353 23389 23243 256 KB x 4 L2 2 26453 19851 9189 49181 49223 34969 46653 46759 46414 4 core 8 Thrd 4 41845 26975 10063 85909 93852 40163 89202 90572 87329 3900 MHz 8 58734 43723 9980 97139 98446 40062 91320 93885 93125 Windows SIMD 16 57731 42194 10178 94166 93338 40074 90162 92102 93496 Core i7 4820K 1 10046 9901 5906 24629 24382 19832 23411 23361 23246 256 KB x 4 L2 2 26634 19679 9250 49194 49267 35183 46788 46788 46382 4 core 8 Thrd 4 52424 39057 10092 60266 98220 39744 90948 90611 92515 3900 MHz 8 58601 43529 10032 85198 98220 40162 93810 93866 93745 Windows AVX 1 16 57098 42920 10319 86267 95243 40427 92929 92995 92356 Linux MFLOPS 1 to 8 Threads Operations Per Word 2 2 2 8 8 8 32 32 32 Million Words 0.10 1.02 10.24 0.10 1.02 10.24 0.10 1.02 10.24 Threads Core i7 64 bit 1 9681 9759 5990 24533 24570 19975 23269 23307 23052 4820K 4 45340 21688 9237 49320 49918 36638 46942 89676 91029 Linux SIMD SSE 8 54621 41832 10026 92086 92352 39982 92408 93282 92050 Core i7 64 bit 1 12542 11404 5991 35982 36180 23299 46400 46572 44729 4820K 4 62273 23031 8970 159040 80096 40124 90572 91058 88877 Linux SIMD AVX 8 60258 44329 9977 173224 151909 40153 173372 177831 158594 |
Windows OpenMP Benchmarks - OpenMP32MFLOPS.exe, SSE32MFLOPS.exe (same code no OpenMP directives) and OpenMP64MFLOPS.exe are included in openmpmflops.zip. Further details and results are included in openmp mflops.htm. Different OpenMP benchmarks are covered in openmp speeds.htm.
With Visual Studio 2012, Microsoft added QPAR, Auto-Parallelizer, to the compiler, that can automatically generate multiple threads in the same way as OpenMP. The benchmark QparMP64MFLOPS.exe was produced, with execution and source files included in gigaflops-benchmarks.zip, with details and results in GigaFLOPS Benchmarks.htm and quad core 8 thread.htm.
Linux Original Versions - openMPmflops32, openMPmflops64, notOMPmflops32 and notOMPmflops64, from linux openmp.tar.gz with details in linux openmp benchmarks.htm. Then there are Later Versions - openMPmflops64, notOMPmflops64 and openMPmflops64AVX in AVX_benchmarks.tar.gz, with details in Linux AVX_benchmarks.htm.
Results below are again from benchmarking the 3.9 GHz Core i7.
Windows OpenMP64MFLOPS.exe provides similar speeds to 64 bit MP-MFLOPS SISD at 32 operations per word, otherwise it is slower.
QparMP64MFLOPS.exe obtains similar 4 thread performance as MP-MFLOPS SIMD. QPAR appears to provide a better alternative than OpenMP but, overall, hand coded multithreading seems to be the best option.
Linux notOMPmflops64 V1 and V2 achieve similar speeds as the single thread MP-MFLOPS benchmark, but not so, compared to the 4 thread test, and particularly the one using 8 threads.
openMPmflops64AVX performance is generally inferior to that from Linux MPmflops64AVX.
Operations Per Word 2 2 2 8 8 8 32 32 32 Million Words 0.10 1.02 10.24 0.10 1.02 10.24 0.10 1.02 10.24 Windows SSE32MFLOPS.exe 4898 4845 4171 5824 5994 6094 5796 5829 5795 OpenMP32MFLOPS.exe 6511 9290 9119 14351 17324 17592 21454 22884 22850 OpenMP64MFLOPS.exe 8420 12440 9483 18477 23210 23737 22134 18281 19690 QparMP64MFLOPS.exe 1 Thread 9691 9454 5743 23214 23126 19033 22700 23541 23405 2 Threads 23972 18673 9177 44855 44919 33868 44070 45733 46419 4 Threads 43356 36007 10084 76380 91259 40349 85300 81803 69212 8 Threads 44741 33966 9732 81506 73857 36635 87736 91170 87086 Linux notOMPmflops64 V1 10093 9803 5919 24634 24651 20097 23519 23520 23339 openMPmflops64 V1 9084 12363 8089 22273 23039 22432 22683 23195 23096 notOMPmflops32 V2 3884 3886 3612 6145 6151 6067 5837 5835 5830 openMPmflops32 v2 9483 12481 8628 22347 23032 22742 22691 23247 23126 notOMPmflops64 V2 9879 9772 5934 24500 24529 20039 23285 23290 23090 openMPmflops64 V2 11163 20322 9180 45392 49695 33927 21534 22477 22476 openMPmflops64AVX 19713 37822 9219 94036 68725 36923 22761 23133 23019 |